Hilberg’s Conjecture: an Updated FAQ
نویسنده
چکیده
This note is a brief introduction to theoretical and experimental results concerning Hilberg’s conjecture, a hypothesis about natural language. The aim of the text is to provide a short guide to the literature. 1 What is Hilberg’s conjecture? In the early days of information theory, Shannon (1951) published estimates of conditional entropy for printed English. A few decades later, Hilberg (1990) replotted these estimates in doubly logarithmic scale and noticed that the entropy H(n) of n consecutive letters of text grows like H(n) ≈ Bn + hn, (1) where β is close to 0.5 and the entropy rate h equals 0, cf. Cover and Thomas (2006); Crutchfield and Feldman (2003). Although Shannon’s data points extended only to n ≤ 100 letters, Hilberg supposed that relationship (1) with h = 0 holds for any natural language and is true for much larger n, being the length of a random text or even longer. This hypothesis will be called the original Hilberg conjecture (Dębowski, 2014b). Hilberg’s conjecture corroborates and strengthens Zipf’s preformal insight that texts produced by humans diverge from both pure randomness and pure determinism (Zipf, 1965, p. 187), albeit they are in a sense both partly random (H(n) > 0) and asymptotically deterministic (h = 0) (Dębowski, 2014b). Condition h = 0 is incompatible with the hypothesis of constant conditional entropy, recently proposed by cognitive scientists and critically examined by Ferrer-i-Cancho et al. (2013). Moreover, h = 0 implies that texts in natural language are asymptotically infinitely compressible. If relationship (1) is true with h = 0, we need an explanation why texts cannot be fantastically well compressed by modern text compressors, such as the Lempel-Ziv code (Cover and Thomas, 2006). We will address this issue answering Question 3. However, if we do not believe in asymptotic determinism of human utterances, we may still consider relationship (1) where h = 0 does not hold necessarily. Such relationship will be called the relaxed Hilberg conjecture. The relaxed Hilberg conjecture is equivalent to the statement that the mutual information
منابع مشابه
Hilberg’s Conjecture — a Challenge for Machine Learning
We review three mathematical developments linked with Hilberg’s conjecture—a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second,...
متن کاملEmpirical Evidence for Hilberg’s Conjecture in Single-Author Texts
Hilberg’s conjecture is a statement that the mutual information between two adjacent blocks of text in natural language scales as n , where n is the block length. Previously, this hypothesis has been linked to Herdan’s law on the levels of word frequency and of text semantics. Thus it is worth a direct empirical test. In the present paper, Hilberg’s conjecture is tested for a selection of Engli...
متن کاملA Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture
Hilberg’s conjecture states that the mutual information between two adjacent long blocks of text in natural language grows like a power of the block length. The exponent in this hypothesis can be upper bounded using the pointwise mutual information computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be univers...
متن کاملA Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture
Hilberg’s conjecture states that the mutual information between two adjacent long blocks of text in natural language grows like a power of the block length. The exponent in this hypothesis can be upper bounded using the pointwise mutual information computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be univers...
متن کاملA New Universal Code Helps to Distinguish Natural Language from Random Texts
Using a new universal distribution called switch distribution, we reveal a prominent statistical difference between a text in natural language and its unigram version. For the text in natural language, the cross mutual information grows as a power law, whereas for the unigram text, it grows logarithmically. In this way, we corroborate Hilberg’s conjecture and disprove an alternative hypothesis ...
متن کامل